Chapter 9 - clustering: computer exercises

Author

Prof. Richard Wilkinson

Task 1

Perform hierarchical clustering with single, complete and average linkage using the iris data. You could also look at other method’s such as Ward’s method (see ?hclust for details).

iris.single <- hclust(dist(iris[,1:4],method="euclidean"),method="single")
plot(iris.single, labels=iris$Species,cex=0.7)

iris.complete <- hclust(dist(iris[,1:4],method="euclidean"),method="complete")
plot(iris.complete, labels=iris$Species,cex=0.7)

iris.ward <- hclust(dist(iris[,1:4],method="euclidean"),method="ward.D2")
plot(iris.ward, labels=iris$Species,cex=0.7)

iris.average <- hclust(dist(iris[,1:4],method="euclidean"),method="average")
plot(iris.average, labels=iris$Species,cex=0.7)

  • In each case, cut the dendrogram to give three distinct groups, and compute the confusion matrix comparing the clusters found with the Species label. Comment on which linkage method has worked best in this case.
cutree(iris.single, k=3)
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2
table(cutree(iris.single, k=3), iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         50        48
  3      0          0         2
cutree(iris.complete, k=3)
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2
 [75] 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2 2 2 2 2 3 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2
table(cutree(iris.complete, k=3), iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         23        49
  3      0         27         1
cutree(iris.ward, k=3)
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2
table(cutree(iris.ward, k=3), iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         49        15
  3      0          1        35
cutree(iris.average, k=3)
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2
table(cutree(iris.average, k=3), iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         50        14
  3      0          0        36

The dendrograms show that Ward’s method most clearly finds three distinct clusters. Three clusters are also clear(ish) in the complete linkage method, but the single and average linkage method do not really suggest there are three natural groups, with two clusters looking more natrual from both dendrograms (Setosa versus the other species).

The confusion matrices show that when cut into 3 clusters, Ward’s method and average linkage do a good job of finding the three groups of species. The single linkage method does a poor job. Complete linkage does not separate versicolor from virginica very well.

  • We do not normally know the species/cluster labels when carrying out cluster analysis, and so can we still say anything about which method is better if you were expecting to see three distinct groups?

It would depend on what feature you are expecting to see. If we were told that there should be three groups, then complete linkage and Ward’s method both give three clearly distinct groups whereas single linkage only gives two clear clusters. So, complete linkage and Ward’s method would appear to be better if the purpose is to form three distinct clusters.

  • Compare the hierarchical clustering methods with the results of doing K-means clustering and model-based clustering (assuming multivariate normal distributions for each population).

First lets look at k-means clustering.

set.seed(123)
iris2 <- iris[,1:4]
(iris.k <- kmeans(iris2, centers = 3, nstart=25)) 
K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
table(iris.k$cluster, iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         48        14
  3      0          2        36
library(factoextra)
fviz_cluster(iris.k, iris[, -5], ellipse.type = "norm")

We can see this is pretty good at finding the three natural clusters (note the labels 1, 2, 3 don’t mean anything), with similar performance to the hierarchical clustering answer with Ward’s method.

Now let’s look at model-based clustering.

library(mclust)
iris.m <- Mclust(iris2,G=3)
table(iris.m$classification, iris$Species)
   
    setosa versicolor virginica
  1     50          0         0
  2      0         45         0
  3      0          5        50
plot(iris.m, what = c("classification"))

plot(iris.m, what = c("density"))

In terms of finding the natural clusters, this has worked spectacularly well. In both model-based clustering and K-means, there is an assumption that the clusters have a multivariate normal distribution, but in K-means we assume all clusters have the same covariance matrix that is a multiple of the identity. In contrast, the model-based clustering algorithm we used allows each cluster to have a different (non-diagonal) covariance. We can see from the projections, that the estimated covariances are different for the three clusters, which in this case is why model-based clustering out-performs K-means.

If you look at the Mclust help page, you will see that it tries a number of different models for multivariate data. The details of these models are found by typing

?mclustModelNames

For example, the model abbreviated as EII assumes a common spherical covariance matrix for all populations. In contrast, VVV assumes a completely different covariance matrix for each cluster. The Mclust command fits 14 different models, and then evaluates the BIC (think of this as a score for each model, with the largest value being best). In this case

iris.m$BIC
Bayesian Information Criterion (BIC): 
       EII       VII       EEI       VEI       EVI       VVI       EEE
3 -878.765 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647
        VEE       EVE       VVE      EEV       VEV       EVV       VVV
3 -605.3982 -666.5491 -636.4259 -644.781 -562.5522 -656.0359 -580.8396

Top 3 models based on the BIC criterion: 
    VEV,3     VVV,3     VEE,3 
-562.5522 -580.8396 -605.3982 

we can see the VEV model is best, which assumes the covariance matrices for each cluster have the same shape, but may be rotated and scaled differently for each cluster. The completely general VVV model is a close second.

Task 2

Download the Indian Premier League data from Moodle and load it into R. We will filter the data to only look at players who played at least 10 innings, and then select just the information on the number of runs they scored, their high score (HS), their batting average (Avg), their best figures (BF), their strike rate (SR), and the number of 4s and 6s they hit.

library(dplyr)
IPL<-read.csv('IPL.csv')
IPL10<-IPL |> dplyr::filter(Mat.x>=10) |>
  dplyr::select(PLAYER,Runs.x, HS, Avg.x, BF, SR.x, X4s, X6s)

Apply agglomerative hierarchical clustering to these data. You will need to first compute a distance matrix for the data, which can be done with the dist command.

  1. Using the Euclidean distance, do single linkage, complete linkage, and average linkage give similar dendrograms?
D <- dist(IPL10[,2:8], method = "euclidean")
plot(hclust(D, method="single"), labels = IPL10$PLAYER, cex=0.5)

plot(hclust(D, method="complete"), labels = IPL10$PLAYER, cex=0.5)

plot(hclust(D, method="average"), labels = IPL10$PLAYER, cex=0.5)

The complete and average linkage dendrograms are very close, but single linkage is completely different. To see that the complete and average linkage dendrograms are similar, lets use both of them to create 4 clusters, and then look at the similarities.

IPL.comp <- hclust(D, method="complete")
IPL.av <-hclust(D, method="average")
table(cutree(IPL.comp, k=4), cutree(IPL.av,4))
   
     1  2  3  4
  1 21  0  0  0
  2  0 16  1  0
  3  0  0  0 16
  4  0  0  3  0

Remember the cluster labels are arbitrary, so in this case can see the 4 clusters found are remarkably similar with only a single difference in cluster assignment.

  1. Do your results change much if we use a different distance measure (e.g. Manhattan)?
D <- dist(IPL10[,2:8], method = "manhattan")
plot(hclust(D, method="single"), labels = IPL10$PLAYER, cex=0.5)

plot(hclust(D, method="complete"), labels = IPL10$PLAYER, cex=0.5)

plot(hclust(D, method="average"), labels = IPL10$PLAYER, cex=0.5)

We can see that using the Manhattan distance has not made a big difference to the dendrograms, which can be checked by looking at the cluster assignments.

  1. There are several R packages for creating different types of dendrogram plots. Have a look at the link here and try creating an alternative type of dendrogram. Consider whether this has helped communicate anything of interest about the data.
library("ape")
tmp <- IPL10[,2:8]
rownames(tmp)<- IPL10$PLAYER
D <- dist(tmp, method = "euclidean")
IPL.comp <- hclust(D, method="complete")
plot(as.phylo(IPL.comp), type = "cladogram", cex = 0.6, label.offset = 0.5,)

plot(as.phylo(IPL.comp), type = "radial", cex=0.6)

Personally, I don’t find these variations useful, but it is a matter of personal taste. If I knew more about the IPL, I might be able to explain some of the clusters that appear here (e.g., attacking aggressive batsman, slow-scorers, bowlers, etc)

Task 3

Look at the data stored in the USArrests data frame in R. You can read about the data by typing ?USArrests.

Apply a selection of clustering methods to these data and discuss how many clusters appear to be present.

Let’s just look at a complete linkage dendrogram, and k-means clustering.

plot(hclust(dist(USArrests), method='complete'))

library(factoextra)
fviz_nbclust(USArrests, kmeans, method = "wss")

Both of these suggest there are 3 natural clusters here.